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1.  Summary 


Over  the  past  two  years  we  signifieantly  increased  recall  of  GeneWays  pipeline  processing, 
developed,  tested  and  applied  tools  for  automated  data  cleaning  (AI  curation)  of  the  produced 
database.  We  also  increased  significantly  (>  50%)  the  volume  of  textual  information  processed  by 
the  GeneWays  pipeline,  applied  AI  curator  tools  to  the  newly  generated  database  and  incorporated 
automatically  curated  data  into  a  new  version  of  a  GeneWays  database. 

2.  Introduction 

Picture  a  tribe  of  bright,  but  ignorant,  cave  people  trying  to  understand  the  work  of  a  modern  car 
by  analyzing  a  collection  of  damaged  cars  produced  by  various  makers.  After  many  hours  of  hard 
manual  labor,  the  cave  people  disassemble  the  cars  into  myriad  small  parts.  Some  parts  are 
damaged,  whereas  some  are  intact.  A  few  interact  with  each  other,  while  others  do  not.  Some 
pieces  are  different  in  different  cars,  yet  apparently  have  the  same  function.  The  leap  to 
understanding  the  whole  from  knowing  the  parts  requires  compilation  of  many  pieces  of 
information  into  a  comprehensive  “computable”  model.  Researchers  in  the  field  of  molecular 
biology  are  in  a  situation  similar  to  that  of  the  junkyard  cave  people,  save  that  they  are 
contemplating  a  collection  of  diverse  pieces  of  cellular  machinery — the  number  of  those  cellular 
components  is  way  greater  than  the  number  of  parts  in  a  typical  car — the  number  of  nodes  in 
human  molecular  networks  is  measured  in  hundreds  of  thousands  when  all  substances  (genes, 
RNAs,  proteins,  and  other  molecules)  are  considered  together.  These  numerous  substances  can  be 
in  turn  present  or  absent  in  dozens  of  cell  types  in  humans — clearly,  the  complexity  is  too  great  to 
yield  to  manual  analysis. 

The  information  overload  in  molecular  biology  is  a  mere  example  of  the  status  common  to  all 
fields  of  the  current  science  and  culture:  An  ever-strengthening  avalanche  of  novel  data  and  ideas 
overwhelms  specialists  and  non-specialists  alike,  unavoidably  fragments  knowledge,  and  makes 
enormous  chunks  of  knowledge  invisible/inaccessible  to  those  who  desperately  need  it. 

The  help  of  relieving  the  information  overload  may  come  from  the  text-miners  who  can 
automatically  extract  and  catalogue  facts  described  in  books  and  journals. 
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Figure  1.  Cocaine  :  The  predicted  accuracy  of  individual  text-mined  facts  involving  semantic 

relation. 

In  Figure  1,  each  directed  arc  from  an  entity  A  to  an  entity  B  should  be  interpreted  as  a  statement 
stimulates  B",  where,  for  example,  A  is  cocaine  and  B  is  progesterone.  The  predicted  accuracy 
of  individual  statements  is  indicated  both  in  color  and  in  width  of  the  corresponding  arc.  Note  that, 
for  example,  the  relation  between  cocaine  and  progesterone  was  derived  from  multiple  sentences, 
and  different  instances  of  extraction  output  had  markedly  different  accuracy.  Altogether  we 
collected  3,  910  individual  facts  involving  cocaine.  Because  the  same  fact  can  be  repeated  in 
different  sentences,  only  1,  820  facts  out  of  3,  910  were  unique.  The  facts  cover  80  distinct 
semantic  relations,  out  of  which  stimulate  is  just  one  example. 


2 


3.  Methods,  Assumptions,  and  Procedures 

Information  extraction  uses  computer-aided  methods  to  recover  and  structure  meaning  that  is 
locked  in  natural-language  texts.  The  assertions  uncovered  in  this  way  are  amenable  to 
computational  processing  that  approximates  human  reasoning.  In  the  special  case  of  biomedical 
applications,  the  texts  are  represented  by  books  and  research  articles,  and  the  extracted  meaning 
comprises  diverse  classes  of  facts,  such  as  relations  between  molecules,  cells,  anatomical 
structures,  and  maladies. 

Unfortunately,  the  current  tools  of  information  extraction  produce  imperfect,  noisy  results. 
Although  even  imperfect  results  are  useful,  it  is  highly  desirable  for  most  applications  to  have  the 
ability  to  rank  the  text-derived  facts  by  the  confidence  in  the  quality  of  their  extraction  (as  we  did 
for  relations  involving  cocaine,  see  Figure  1).  We  focus  on  automatically  extracted  statements 
about  molecular  interactions,  such  as  small  molecule  A  binds  protein  B,  protein  B  activates  gene  C, 
or  protein  D  phosphorylates  small  molecule  E.  (In  the  following  description  we  refer  to  phrases 
that  represent  biological  entities  (such  as  small  molecule  A,  protein  B,  and  gene  C)  as  terms,  and  to 
biological  relations  between  these  entities  (such  as  activate  or phosphorylate)  as  relations  or 
verbs.) 

Several  earlier  studies  have  examined  aspects  of  evaluating  the  quality  of  text-mined  facts.  For 
example,  Sekimizu  et  al.  and  Ono  et  al.  attempted  to  attribute  different  confidence  values  to 
different  verbs  that  are  associated  with  extracted  relations,  such  as  activate,  regulate,  and  inhibit 
[1,2].  Thomas  et  al.  proposed  to  attach  a  quality  value  to  each  extracted  statement  about  molecular 
interactions  [3],  although  the  researchers  did  not  implement  the  suggested  scoring  system  in 
practice.  In  an  independent  study  [4],  Blaschke  and  Valencia  used  word-distances  between 
biological  terms  in  a  given  sentence  as  an  indicator  of  the  precision  of  extracted  facts.  In  our 
present  analysis  we  applied  several  machine-learning  techniques  to  a  large  training  set  of  98,679 
manually  evaluated  examples  (pairs  of  extracted  facts  and  corresponding  sentences)  to  design  a 
tool  that  mimics  the  work  of  a  human  curator  who  manually  cleans  the  output  of  an  information- 
extraction  program. 

Approach 

Our  goal  was  to  design  a  tool  that  could  be  used  with  any  information-extraction  system  developed 
for  molecular  biology.  In  this  study,  our  training  data  came  from  the  GeneWays  project 
(specifically,  GeneWays  6.0  database,  [5,6])  and  thus  our  approach  is  biased  toward  relationships 
that  are  captured  by  that  specific  system.  We  believe  that  the  spectrum  of  relationships  represented 
in  the  GeneWays  ontology  is  sufficiently  broad  that  our  results  will  prove  useful  for  other 
information-extraction  pro] ects. 

Our  approach  followed  the  path  of  supervised  machine-learning.  First,  we  generated  a  large 
training  set  of  facts  that  were  originally  gathered  by  our  information-extraction  system,  and  then 
manually  labeled  as  “correct”  or  “incorrect”  by  a  team  of  human  curators.  Second,  we  used  a 
battery  of  machine-learning  tools  to  imitate  computationally  the  work  of  the  human  evaluators. 
Third,  we  split  the  training  set  into  ten  parts,  so  that  we  could  evaluate  the  significance  of 
performance  differences  among  the  several  competing  machine-learning  approaches. 
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Methods 


Training  data 

With  the  help  of  a  text-annotation  eompany,  For  Science  Inc.,  we  generated  a  training  set  of 
approximately  45,000  multiple-annotated  unique  faets,  or  almost  100,000  independent 
evaluations.  These  facts  were  originally  extracted  by  the  GeneWays  pipeline,  then  were  annotated 
by  biology-savvy  doctoral-level  curators  as  “correct”  or  “incorrect,”  referring  to  quality  of 
information  extraction.  Examples  of  automatically  extracted  relations,  sentences  corresponding  to 
each  relation,  and  the  labels  provided  by  three  evaluators  are  shown  in  Table  1. 


Table  1.  A  sample  of  sentences  that  were  used  as  an  input  to  automated  information  extraction 


Sentenee  [Souree] 

Extraeted  relation 

Evaluation 

(Confidenee) 

NIK  binds  to  Nek  in  eultured  eells.[8] 

nik  bind  nek 

Correet  (High) 

One  is  that  presenilin  is  required  for  the  proper 
traffieking  of  Notch  and  APP  to  their  proteases,  whieh 
may  reside  in  an  intraeellular  eompartment.  [9] 

presenilin  required  for  noteh 

Correet  (High) 

Serine  732  phosphorylation  ofFAKby  Cdk5  is 
important  for  mierotubule  organization,  nuelear 
movement,  and  neuronal  migration.  [10] 

edk5  phosphorylate  fak 

Correet  (High) 

Histogram  quantifying  the  pereent  of  Arr2  bound  to 

arr2  bind  rhodopsin 

Correet  (Low) 

rhodopsin-containing  membranes  after  treatment  with 
blue  light  (B)  or  blue  light  followed  by  orange  light  (BO). 
1111 


It  is  now  generally  aeeepted  that  a  shift  from  monomer  to  eadherin  activate  eadherins 
dimer  and  eadherin  clustering  activates  classic 
eadherins  at  the  surface  into  an  adhesively  competent 
conformation.  [12] 

Correct  (Low) 

Binding  of  G  to  CSP  was  four  times  greater  than  binding  esp  bind  syntaxin 
to  syntaxin.  [13] 

Incorrect  (Low) 

Treatment  with  NEM  applied  with  cGMP  made 
activation  by  cAMP  more  favorable  by  about  2.5 
kcal/mol.  [14] 

camp  activate  cgmp 

Incorrect  (Low) 

This  matrix  is  likely  to  consist  of  actin  fdaments,  as 
similar  fdaments  can  be  induced  by  actin-stabilizing 
toxins  (O.  S.  et  al.,  unpublished  data).  [15] 

actin  induce  actin 

Incorrect  (High) 

A  ligand-gated  association  between  cytoplasmic 
domains  of  UNC5  and  DCC  family  receptors  converts 
netrin- induced  growth  cone  attraction  to  repulsion.  [16] 

cytoplasmic  domains  associate  unc5 

Incorrect  (High) 
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In  Table  1,  a  sample  of  sentences  that  were  used  as  an  input  to  automated  information  extraction 
(the  first  column),  biological  relations  extracted  from  these  sentences  (either  correctly  or 
incorrectly,  the  second  column),  and  the  corresponding  evaluations  provided  by  3  human  experts 
(the  third  column).  A  high-confidence  label  corresponds  to  a  perfect  agreement  among  all  experts; 
a  low-confidence  label  indicates  that  one  of  the  experts  disagreed  with  the  other  two.  Clearly, 
automated  information  extraction  can  be  associated  with  a  loss  of  detail  of  meaning,  as  is  in  the 
case  of  cadherin  activates  cadherins  example  (sentence  5). 

Each  extracted  fact  was  evaluated  by  one,  two,  or  three  different  curators.  The  complete  evaluation 
set  comprised  98,679  individual  evaluations  performed  by  four  different  people,  so  most  of  the 
statement-sentence  pairs  were  evaluated  multiple  times,  with  each  person  evaluating  a  given  pair 
at  most  once.  In  total,  13,502  statement/sentence  pairs  were  evaluated  by  just  one  person,  10,457 
by  two  people,  21,421  by  three  people,  and  57  by  all  four  people.  Examples  of  both  high  inter- 
annotator  agreement  and  low-agreement  sentences  are  shown  in  Table  1. 


Table  2,  List  of  annotation  choices  available  to  the  evaluators. 


Term  level 


Relation  level 


Sentence  level 


Upstream  term  is  a  junk 
substance 

Action  is  incorrect  biologically 
Downstream  term  is  a  junk 
substance 

Correctly  extracted 
Sentence  is  hypothesis,  not  fact 
Unable  to  decide 
Incorrectly  extracted 
Incorrect  upstream 
Incorrect  downstream 
Incorrect  action  type 
Missing  or  extra  negation 
Wrong  action  direction 
Sentence  does  not  support  the 
action 

Wrong  sentence  boundary 


In  Table  2,  the  term  “action”  refers  to  the  type  of  the  extracted  relation.  Eor  example,  in  statement 
A  binds  B  “binds”  is  the  action,  “A”  is  the  upstream  term,  and  “B”  is  the  downstream  term.  Action 
direction  is  defined  as  upstream  to  downstream,  and  “junk  substance”  is  an  obviously  incorrectly 
identified  term/entity. 

The  statements  in  the  training  data  set  were  grouped  into  chunks;  each  chunk  was  associated  with  a 
specific  biological  project,  such  as  analysis  of  interactions  in  Drosophila  melanogaster.  Pair-wise 
agreement  between  evaluators  was  high  (92%)  in  most  chunks,  with  the  exception  of  a  chunk  of 
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5,271  relations  where  agreement  was  only  74%.  These  relatively  low-agreement  evaluations  were 
not  ineluded  in  the  training  data  for  our  analysis. 

To  facilitate  evaluation,  we  developed  a  Sentence  Evaluation  Tool  implemented  in  Java 
programming  language  by  Mitzi  Morris  and  Ivan  lossifov.  This  tool  presented  to  an  evaluator  a  set 
of  annotation  choices  regarding  each  extracted  fact;  the  choices  are  listed  in  Table  2.  The  tool  also 
presented  in  a  single  window  the  fact  itself  and  the  sentence  it  was  derived  from.  In  the  case  where 
a  broader  context  was  required  for  the  judgment,  the  evaluator  had  a  choice  to  retrieve  the 
complete  journal  article  containing  this  sentence  by  clicking  a  single  button  on  the  program 
interface. 

For  convenience  in  representing  the  results  of  manual  evaluation,  we  computed  an  evaluation  score 
for  each  statement  as  follows.  Each  sentence-statement  score  was  computed  as  a  sum  of  the  scores 
assigned  by  individual  evaluators;  for  each  evaluator,  -1  was  added  if  the  expert  believed  that  the 
presented  information  was  extracted  incorrectly,  and  +1  was  added  if  he  or  she  believed  that 
extraction  was  correct.  For  a  set  of  three  experts,  this  method  permitted  four  possible  scores: 

3(1,1,1)  ,  1(1,1,-1) ,  -1(1,-1,-1) ,  and  -3  .  Similarly,  for  just  two  experts,  the  possible  scores  are 
2(1,1),  0(1, -1),  and  -2(-l,-l). 


Computational  methods 
Machine-learning  algorithms 
General  framework 

The  objects  that  we  want  to  classify,  the  fact-sentence  pairs,  have  complex  properties.  We  wanted 
to  place  each  of  the  objects  into  one  of  two  classes,  correct  or  incorrect.  In  the  training  data,  each 
extracted  fact  was  matched  to  a  unique  sentence  from  which  it  was  extracted,  even  though  multiple 
sentences  can  express  the  same  fact  and  a  single  sentence  can  contain  multiple  facts.  The  object 
(the  f’'  fact-sentence  pair)  comes  with  a  set  of  known  features  or  properties  that  we  encoded  into  a 
feature  vector,  F. : 


In  the  following  description  we  used  C  to  indicate  the  random  variable  that  represents  class  (with 
possible  values  and  and  F  to  represent  a  Ixn  random  vector  of  feature  values 

(also  often  called  attributes),  such  that  F.  is  the  f"  element  of  F  .  For  example,  for  fact  p53 
activates  JAK,  feature  fy  would  have  value  1  because  the  upstream  term  p53  is  found  in  a 
dictionary  derived  from  the  GenBank  database  [19];  otherwise,  it  would  have  value  0  . 
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Full  Bayesian  inference 

The  full  Bayesian  elassifier  assigns  the  i'^  objeet  to  the  k‘'‘  elass  if  the  posterior  probability 
P[C  =  c^\F  =  ¥.)  is  greater  for  the  k‘''  elass  than  for  any  alternative  elass.  This  posterior 
probability  is  eomputed  in  the  following  way  (a  re-stated  version  of  Bayes’  theorem). 


P  C  =  cFF  =  ¥..]  =  P  C  =  cAx- 


PiF  =  ¥,\C  =  c, 
PiF  =  ¥,] 


-■(2) 


In  real-life  applieations,  we  estimate  probability  P[F  =  ¥.  |  C  =  )  from  the  training  data  as  a  ratio 

of  the  number  of  objeets  that  belong  to  the  elass  c^.  and  have  the  same  set  of  feature  values  as 
speeified  by  the  veetor  F.  to  the  total  number  of  objeets  in  elass  c^.  in  the  training  data. 

In  other  words,  we  estimate  the  eonditional  probability  for  every  possible  value  of  the  feature 
veetor  F  for  every  value  of  elass  C .  Assuming  that  all  features  ean  be  diseretized,  we  have  to 
estimate 


(Vj  xvj  x...v„  -l)xm(3) 

parameters,  where  v.  is  the  number  of  diserete  values  observed  for  the  feature  and  m  is  the 
number  of  elasses. 

Clearly,  even  for  a  spaee  of  only  20  binary  features  the  number  of  parameters  that  we  would  need 
to  estimate  is  - 1)  x  2  =  2, 097, 150,  whieh  exeeeds  several  times  the  number  of  data  points  in 
our  training  set. 

Naive  Bayes  classifier 

The  most  affordable  approximation  to  the  full  Bayesian  analysis  is  the  Naive  Bayes  elassifier.  It  is 
based  on  the  assumption  of  eonditional  independenee  of  features: 

P(F  =  F,|C  =  c,|  =  P(i-  =A,|C  =  c.) 

>^PF=f,2\C  =  C,)... 

FP,.=f,2,\C  =  c,].  (4) 

Obviously,  we  ean  estimate  p[^  Fj  =  yj  ^  |  C  =  j ’s  reasonably  well  with  a  relatively  small  set  of 
training  data,  but  the  assumption  of  eonditional  independenee  (Equation  4)  eomes  at  a  priee:  the 
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Naive  Bayes  elassifier  is  usually  markedly  less  sueeessful  in  its  job  than  are  its  more  sophisticated 
relatives. 

In  an  application  with  m  classes  and  n  features  (given  that  the  i"'  feature  has  v.  admissible 
discrete  values),  a  Naive  Bayes  algorithm  requires  estimation  of  m  x  ^  ^  (v.  - 1)  parameters 
(which  value,  in  our  case,  is  equal  to  4, 208  ). 

Middle  ground  between  the  full  and  Naive  Bayes:  Clustered  Bayes 

We  can  find  an  intermediate  ground  between  the  full  and  Naive  Bayes  classifiers  by  assuming  that 
features  in  the  random  vector  F  are  arranged  into  groups  or  clusters,  such  that  all  features  within 
the  same  cluster  are  dependent  on  one  another  (conditionally  on  the  class),  and  all  features  from 
different  classes  are  conditionally  independent.  That  is,  we  can  assume  that  the  feature  random 
vector  (F )  and  the  observed  feature  vector  for  the  i‘^  object  (F.)  can  be  partitioned  into  sub¬ 
vectors; 


F  =  (0i,02,-,3^MXand  (5) 

F,=(f,„f.2,...,f,^),  (6) 

respectively,  where  is  the  cluster  of  features;  j  is  the  set  of  values  for  this  cluster  with 
respect  to  the  i’'  object,  and  M  is  the  total  number  of  clusters  of  features. 


The  Clustered  Bayes  classifier  is  based  on  the  following  assumption  about  conditional 
independence  of  clusters  of  features; 

/>(F  =  F,|C  =  c.|  =  P(®,=f,,|C  =  c,) 


=  (7) 


We  tested  two  versions  of  the  Clustered  Bayes  classifier;  one  version  used  all  68  features 
(Clustered  Bayes  68)  with  a  coarser  discretization  of  feature  values;  another  version  used  a  subset 
of  44  features  (Clustered  Bayes  44)  but  allowed  for  more  discrete  values  for  each  continuous¬ 
valued  feature,  see  legend  to  Figure  2. 
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Figure  2,  The  correlation  matrix  for  the  features  used  by  the  classification  algorithms. 

In  Figure  2,  the  half-matrix  below  the  diagonal  was  derived  from  analysis  of  the  whole  Gene  Ways 
6.0  database;  the  half-matrix  above  the  diagonal  represents  a  correlation  matrix  estimated  from 
only  the  manually  annotated  data  set.  The  white  dotted  lines  outline  clusters  of  features,  suggested 
by  analysis  of  the  annotated  data  set;  we  used  these  clusters  in  implementation  of  the  Clustered 


9 


Bayes  classifier.  We  used  two  versions  of  the  Clustered  Bayes  classifier;  with  all  68  features 
(Clustered  Bayes  68),  and  with  a  subset  of  only  44  features,  but  higher  number  of  discrete  values 
allowed  for  non-binary  features  (Clustered  Bayes  44).  The  Clustered  Bayes  44  classifier  did  not 
use  features  1,  6,  7,  8,  9,  12,  27,  28,  31,  34,  37,  40,  42,  47,  48,  49,  52,  54,  55,  60,  62,  63,  and  65. 

Linear  and  quadratic  discriminants 

Another  method  that  can  be  viewed  as  an  approximation  to  full  Bayesian  analysis  is  Discriminant 
Analysis  invented  by  Sir  Ronald  A.  Fisher  [20].  This  method  requires  no  assumption  about 
conditional  independence  of  features;  instead,  it  assumes  that  the  conditional  probability 
/•(F  =  FJ  C  =  )  is  a  multivariate  normal  distribution. 


P(F  =  F,,|C  =  c. 


,  (8) 


where  n  is  the  total  number  of  features/variables  in  the  class-specific  multivariate  distributions. 
The  method  has  two  variations.  The  first.  Linear  Discriminant  Analysis,  assumes  that  different 
classes  have  different  mean  values  for  features  (vectors  ju^. ),  but  the  same  variance-covariance 

matrix,  V  =  V^.  for  all  k  (see  Suppl.  Note  7).  In  the  second  variation.  Quadratic  Discriminant 
Analysis  (QDA),  the  assumption  of  the  common  variance-covariance  matrix  for  all  classes,  is 
relaxed,  such  that  every  class  is  assumed  to  have  a  distinct  variance-covariance  matrix,  . 

In  this  study  we  present  results  for  QDA;  the  difference  from  the  linear  discriminant  analysis  was 
insignificant  for  our  data  (not  shown).  In  terms  of  the  number  of  parameters  to  estimate,  QDA  uses 
only  two  symmetrical  class-specific  covariance  matrices  and  the  two  class-specific  mean  vectors. 
For  68  features  the  method  requires  estimation  of  2  x  (68  x  69)/2  -l-  2  x  68  =  4, 828  parameters. 

Maximum-entropy  method 

The  current  version  of  the  maximum-entropy  method  was  formulated  by  E.T.  Jaynes  [21,22];  the 
method  can  be  traced  to  earlier  work  by  J.  Willard  Gibbs.  The  idea  behind  the  approach  is  as 
follows.  Imagine  that  we  need  to  estimate  a  probability  distribution  from  an  incomplete  or  small 
data  set — this  problem  is  the  same  as  that  of  estimating  the  probability  of  the  class  given  the 
feature  vector,  F(C  =  |  F  =  Fj ,  from  a  relatively  small  training  set.  Although  we  have  no  hope 

of  estimating  the  distribution  completely,  we  can  estimate  with  sufficient  reliability  the  first  (and, 
potentially,  the  second)  moments  of  the  distribution.  Then,  we  can  try  to  find  a  probability 
distribution  that  has  the  same  moments  as  our  unknown  distribution  and  the  highest  possible 
Shannon’s  entropy — the  intuition  behind  this  approach  being  that  the  maximum-entropy 
distribution  will  minimize  unnecessary  assumptions  about  the  unknown  distribution.  The 
maximum-entropy  distribution  with  constraints  imposed  by  the  first-order  feature  moments  alone 
(the  mean  values  of  features)  is  known  to  have  the  form  of  an  exponential  distribution  [23]; 
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exp 


p  C  =  c,  IF  =  F,  = 


-S4.4, 


I  !=1 


(9) 


and  the  maximum-entropy  distribution  for  the  ease  when  both  the  first-  and  the  seeond-order 
moments  of  the  unknown  distribution  are  fixed  has  the  form  of  a  multidimensional  normal 
distribution  [23].  The  conditional  distribution  that  we  are  trying  to  estimate  can  be  written  in  the 
following  exponential  form: 


exp 


P  C  =  cAF  =  ¥A  = 


^ x,y,kf  j ,xf  j 

x=l  y=x 


i=\ 


Z/li  ^^P  “Z  ■  Z  Z  ^X,y,lfj,xfj,y 

f=l  x=l  y=x 


(10) 


Parameters  A,.  ^ ’s  and  v^y,,'s  are  k  -class-specific  weights  of  individual  features  and  feature  pairs, 

respectively,  and  in  principle  can  be  expressed  in  terms  of  the  first  and  second  moments  of  the 
distributions.  The  values  of  parameters  in  Equations  9  and  10  are  estimated  by  maximizing  the 
product  of  probabilities  for  the  individual  training  examples. 


We  tested  two  versions  of  the  maximum-entropy  classifier.  MaxEnt  1  uses  only  information  about 
the  first  moments  of  features  in  the  training  data  (Equation  9);  MaxEnt  2  uses  the  set  of  all 
individual  features  and  the  products  of  feature  pairs  (Equation  10).  To  select  the  most  informative 
pairs  of  features  we  used  a  mutual  information  approach,  as  described  in  the  subsection  dealing 
with  classification  features. 


Eor  two  classes  (correct  and  incorrect)  and  68  features  MaxEnt  1  requires  estimation  of  136 
parameters.  In  contrast,  MaxEnt  2  requires  estimation  of  4,828  parameters:  weight  parameters  for 
all  first  moments  for  two  classes,  plus  weights  for  the  second  moments  for  two  classes.  MaxEnt  2- 
V  is  a  version  of  MaxEnt  2  classifier  where  the  squared  values  of  features  are  not  used,  so  that  the 
classifier  requires  estimation  of  only  4,692  weight  parameters. 

Feed- forward  neural  network 

A  typical  feed-forward  artificial  neural  network  is  a  directed  acyclic  graph  organized  into  three  (or 
more)  layers.  In  our  case,  we  chose  a  three-layered  network,  with  a  set  of  nodes  of  the  input  layer, 
{^i}i=\  N  ’  nodes  of  the  hidden  layer,  {y ,  }  ,  and  a  single  node  representing  the  output  layer, 

Zj ,  see  Figure  2.  The  number  of  input  nodes,  ,  is  determined  by  the  number  of  features  used  in 

the  analysis  (68  in  our  case).  The  number  of  hidden  nodes,  ,  determines  both  the  network’s 
expressive  power  and  its  ability  to  generalize.  Too  small  a  number  of  hidden  nodes  makes  a 
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simplistic  network  that  cannot  learn  from  complex  data.  Too  large  a  number  makes  a  network  that 
tends  to  overtrain — that  works  perfectly  on  the  training  data,  but  poorly  on  new  data.  We 
experimented  with  different  values  of  and  settled  on  N^,  =  10  . 

The  values  of  the  input  nodes,  ^  ,  are  feature  values  of  the  objeet  that  we  need  to  classify. 

The  value  of  eaeh  node,  ,  in  the  hidden  layer  is  determined  in  the  following  way: 

yj  =  F{W.  ,X,  +  2X2  +  ...+  ),  (1 1) 

where  F{x)  is  a  hyperbolic  tangent  function  that  creates  an  S-shaped  curve: 


F{x)  = 


e  -e 


e  +e 


(12) 


and  {Wjj^}  are  weight  parameters.  Finally,  the  value  of  the  output  node,  Zj  is  determined  as  a 
linear  combination  of  the  values  of  all  hidden  nodes: 


Zi  =a,y,  +a^y^  +  ...+  a^  yj,  ,  (13) 

where  are  additional  weight  parameters.  We  trained  our  network,  using  a  baek-propagation 
algorithm  [24],  to  distinguish  two  elasses,  correct  and  incorrect,  where  positive  values  of  Zj 
eorresponded  to  the  elass  correct. 

The  feed-forward  neural  network  that  we  used  in  our  analysis  ean  be  thought  of  as  a  model  with 
parameters  ( 690  in  our  case). 
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Figure  3.  A  hypothetical  three-layered  feed-forward  neural  network. 

In  Figure  3,  we  used  a  similar  network  with  68  input  units  (one  unit  per  classification  feature)  and 
10  hidden-layer  units. 

Support  vector  machines 

The  Support  Vector  Machines  (SVM,  [25,26])  algorithm  solves  a  binary  classification  problem  by 
dividing  two  sets  of  data  geometrically,  by  finding  a  hyperplane  that  separates  the  two  classes  of 
objects  in  the  training  data  in  an  optimum  way  (maximizing  the  margin  between  the  two  classes). 

The  SVM  is  a  kerneZ-based  algorithm,  where  the  kernel  is  an  inner  product  of  two  feature  vectors 
(function/transformation  of  the  original  data).  In  this  study,  we  used  three  of  the  most  popular 
kernels;  the  linear,  polynomial  and  Rbf  (radial  basis  function)  kernels.  The  linear  kernel 
^^(XpXj)  =<  Xj,X2  >  is  simply  the  inner  product  of  the  two  input  feature  vectors;  an  SVM  with 
the  linear  kernel  searches  for  a  class-separating  hyperplane  in  the  original  space  of  the  data.  Using 
a  polynomial  kernel,  ^^(XpXj)  =  (1+  <  x^Xj  >Y ,  is  equivalent  to  transforming  the  data  into  a 
higher-dimensional  space  and  searching  for  a  separating  plane  there.  Finally,  using  an  Rbf  kernel, 
A^‘’^(Xj,X2)  =  ,  corresponds  to  finding  a  separating  hyperplane  in  an  infinite-dimensional 

space. 
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In  most  real-world  cases  the  two  classes  cannot  be  separated  perfectly  by  a  hyperplane,  and  some 
classification  errors  are  unavoidable.  SVM  algorithms  use  the  C  -parameter  to  control  the  error 
rate  during  the  training  phase  (if  the  error  is  not  constrained,  the  margin  of  every  hyperplane  can 
be  extended  infinitely).  In  this  study,  we  used  the  default  values  for  the  C  -parameter  suggested  by 
the  SVM  Light  tool.  Table  3  lists  the  SVM  models  and  C  -parameter  values  that  we  used  in  this 
study. 


Table  3.  Parameter  values  used  for  various  SVM  classifiers  in  this  study. 


Model 

Kernel 

Kernel 

C  -parameter 

parameter 

5FM(OSU  SVM) 

Linear 

1 

SVM-tO  (SVM  Light) 

Linear 

1 

SVM-tl-d2 

Polynomial  <7=2 

0.3333 

SVM-tl-d3 

Polynomial  d  =  3 

0.1429 

SVM-t2-g0.5 

Rbf 

g  =  0.5 

1.2707 

SVM-t2-gl 

Rbf 

g  =  l 

0.7910 

SVM-t2-g2 

Rbf 

g  =  2 

0.5783 

The  output  of  an  SVM  analysis  is  not  probabilistic,  but  there  are  tools  to  convert  an  SVM 
classification  output  into  “posterior  probabilities,”  see  chapter  by  J.  Platt  in  [27].  (A  similar 
comment  is  applicable  to  the  artificial  neural  network.) 

The  number  of  support  vectors  used  by  the  SVM  classifier  depends  on  the  size  and  properties  of 
the  training  data  set.  The  average  number  of  ( 1  x  68 -dimensional)  support  vectors  used  in  10  cross- 
validation  experiments  was  12,757.5,  11,994.4,  12,092,  12,289.9,  12,679.7,  and  14,163.8,  for 
SVM,  SVM-tl-d2,  SVM-tl-d3,  SVM-t2-g0.5,  SVM-t2-gl,  and  SVM-t2-g2  classifiers, 
respectively.  The  total  number  of  data-derived  values  (which  we  loosely  call  “parameters”)  used 
by  the  SVM  in  our  cross-validation  experiments  was  therefore,  on  average,  between  827,614  and 
880,270  for  various  SVM  versions. 
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Table  4.  Machine  learning  methods  used  in  this  study  and  their  implementations. 


Method  ImplementationURL  Number  of 

parameters 


Naive  Bayes 

this  study, 
WEKA 

http;//www.cs.waikato.ac.nz/ml/weka/ 

4,208 

Clustered  Bayes  68 

this  study 

N/A 

276,432 

Clustered  Bayes  44 

this  study 

N/A 

361,270 

Discriminant 

Analysis 

this  study 

N/A 

4,828 

SVM 

OSU  SVM 
Toolbox  for 
Matlab 

http://sourceforge.net/projects/svm 

827,614 

SVM-t* 

SVM  light 
[28] 

http  ://svmlight  .j  oachims.org/ 

827,614  to 
880,270 

Neural  Network 

Neural 
Network 
toolbox  for 
Matlab 

N/A 

690 

MaxEnt  1 

Maximum 
Entropy 
Modeling 
Toolkit  for 
Python  and 
C++ 

http://homepages.infed.ac.uk 
/s0450736/maxent  toolkit.html 

136 

MaxEnt  2 

same  as  the 
MaxEnt  1 

same  as  the  MaxEnt  1 

4,828 

MaxEnt  2-v 

same  as  the 
MaxEnt  1 

same  as  the  MaxEnt  1 

4,692 

Meta-Classifier 

OSU  SVM 
Toolbox  for 
Matlab 

http://sourceforge.net/projects/svm 

>  11,560 
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Meta-method 


We  implemented  the  meta-classifier  on  the  basis  of  the  SVM  algorithm  (linear  kernel  withC  =  1 ) 
applied  to  predictions  (converted  into  probabilities  that  the  object  belongs  to  the  class  correct) 
provided  by  the  individual  “simple”  classifiers.  The  meta-method  used  1,445  support  vectors 
(1x7  -dimensional),  in  addition  to  combined  parameters  of  the  seven  individual  classifiers  used  as 
input  to  the  meta-classifier. 

Implementation 

A  summary  of  the  sources  of  software  used  in  our  study  is  shown  in  Table  4. 

Features  used  in  our  analysis 

We  selected  68  individual  features  covering  a  range  of  characteristics  that  could  help  in  the 
classification,  see  Table  5.  To  capture  the  flow  of  information  in  a  molecular  interaction  graph  (the 
edge  direction),  in  each  extracted  relation  we  identified  an  “upstream  term”  (corresponding  to  the 
graph  node  with  the  outgoing  directed  edge)  and  a  “downstream  term”  (the  node  with  the  incoming 
directed  edge):  for  example,  in  the  phrase  “JTTf  phosphorylates  p53,"  JAK  is  the  upstream  term, 
and  p5S  is  the  downstream  term.  Features  in  the  group  keywords  represent  a  list  of  tokens  that  may 
signal  that  the  sentence  is  hypothetical,  interrogative,  negative,  or  that  there  is  confusion  in  the 
relation  extraction  (e.g.  the  particle  “by”  in  passive-voice  sentences).  We  eventually  abandoned 
keywords  as  we  found  them  to  be  uninformative  features,  but  they  are  still  listed  for  the  sake  of 
completeness. 


Table  5.  List  of  the  features  that  we  used  in  the  present  study. 


Group  of  features  Feature(s)  Values  Number  of 

features 

Dietionary  look-  {Upstream,  downstream}  term  ean  be  found  in  Binary  20 

ups  (GeneBank,  NCBI  taxonomy,  LocusLink, 

SwissProt,  FlyBase,  drug  list,  disease  list,  Specialist 
Lexicon,  Bacteria,  English  Dictionary} 

Word  metrics  Length  of  the  sentence  (word  count)  Positive  integer  1 

Distance  between  the  upstream  and  the  downstream  Integer  1 

term 

Minimum  non-negative  word  distance  between  the  Non-negative  1 

upstream  and  the  downstream  term  Integer 

Distance  between  the  upstream  term  and  the  action  Integer  1 

Distance  between  the  downstream  term  and  the  Integer  I 

action 

Previous  scores  Average  score  of  relationships  with  the  same  Real  3 

(upstream  term,  downstream  term,  action} 

Count  of  evaluated  relationships  with  the  same  Positive  integer  3 

(upstream  term,  downstream  term,  action} 

Total  count  of  relationships  with  the  same  (upstreamPositive  integer  3 
term,  downstream  term,  action} 

Average  score  of  relationships  that  share  the  same  Real  1 

pair  of  upstream  and  downstream  terms 

Total  count  of  evaluated  relationships  that  share  the  Positive  integer  1 
same  pair  of  upstream  and  downstream  terms 
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Total  count  of  relationships  with  both  the  same 
upstream  and  downstream  terms 

Positive  integer 

1 

Number  of  relations  extracted  from  the  same 

Positive  integer 

1 

sentence 

Number  of  evaluated  relations  extracted  from  the 
same  sentence 

Positive  integer 

1 

Average  score  of  relations  from  the  same  sentence 

Real 

1 

Number  of  relations  sharing  upstream  term  in  same 
sentence 

Positive  integer 

1 

Number  of  evaluated  relations  sharing  upstream 
term  in  the  same  sentence 

Positive  integer 

1 

Average  score  of  relations  sharing  upstream  term  in 

Real 

1 

same  sentence 

Relations  sharing  downstream  term  in  the  same 
sentence 

Positive  integer 

1 

Evaluated  relations  sharing  downstream  term  in  the 
same  sentence 

Positive  integer 

1 

Average  score  of  relations  sharing  downstream  term  Real 

1 

in  the  same  sentence 

Number  of  relations  sharing  same  action  in  the  same  Positive  integer 

1 

sentence 

Number  of  evaluated  relations  sharing  action  in  the 
same  sentence 

Positive  integer 

1 

Average  score  of  relations  sharing  action  in  the  sameReal 

1 

sentence 

Punctuation 

Number  of  {periods,  commas,  semi-colons,  colons}  Non-negative 

4 

in  the  sentence 

integer 

Number  of  (periods,  commas,  semi-colons,  colons}  Non-negative 

4 

between  upstream  and  downstream  terms 

integer 

Terms 

Semantic  sub-class  category  of  the  (upstream, 
downstream}  term 

Integer 

2 

Probability  that  the  (upstream,  downstream}  term 
has  been  correctly  recognized 

Real 

2 

Probability  that  the  {upstream,  downstream}  term 
has  been  correctly  mapped 

Real 

2 

Part-of-speech 

tags 

(Upstream,  downstream}  term  is  a  noun  phrase 

Binary 

2 

Action  is  a  verb 

Binary 

1 

Other 

Relationship  is  negative 

Binary 

1 

Action  index 

Positive  integer 

1 

Keyword  is  present 

Binary 

(not  used) 

Dictionary  lookups  are  binary  features  indicating  absence  or  presence  of  a  term  in  a  specifie 
dictionary.  Previous  scores  are  the  average  seores  that  a  term  or  an  aetion  has  in  other  relations 
evaluated.  Term-  recognition  probabilities  are  generated  by  the  GeneWays  pipeline  and  reflect  the 
likelihood  that  a  term  had  been  correetly  reeognized  and  mapped.  Sharing  of  the  same  aetion 
(verb)  by  two  different  faets  within  the  same  sentence  oeeurs  in  phrases  such  as  A  and  B  were 
shown  to  phosphorylate  C.  In  this  example,  two  individual  relations,  A  phosphorylates  C  and 
B  phosphorylates  C,  share  the  same  verb,  phosphorylate.  Semantie  eategories  are  entities 
(semantie  elasses)  in  the  GeneWays  ontology  (e.g.  gene,  protein,  geneorprotein).  Part-of- 
speech  tags  were  generated  by  the  Maximum  Entropy  tagger,  MXPOST  [29]. 


17 


To  represent  the  second-order  features  (pairs  of  features),  we  defined  a  new  feature  as  a  product  of 
the  normalized  values  of  two  features.  We  obtained  the  normalized  values  of  features  by 
subtracting  the  mean  value  from  each  feature  value,  then  dividing  the  result  by  the  standard 
deviation  for  this  feature. 

After  a  number  of  feature-selection  experiments  for  the  MaxEnt  2  method  we  settled  on  using  all 
second-order  features. 

Separating  data  into  training  and  testing:  Cross-validation 

To  evaluate  the  success  of  our  classifiers  we  used  a  10-fold  cross-validation  approach,  where  we 
used  ^  of  data  for  training  and  for  testing.  More  precisely,  given  a  partition  of  the  manually 

evaluated  data  into  10  equal  portions,  we  created  10  different  pairs  of  training-test  subsets,  so  that 
10  distinct  testing  sets  put  together  covered  the  whole  collection  of  the  manually  evaluated 
sentences.  We  then  used  10  training-test  set  pairs  to  compare  all  algorithms. 


Comparison  of  methods:  Receiver  operating  characteristic  (ROC)  scores 

To  quantify  and  compare  success  of  the  various  classification  methods  we  used  receiver  operating 
characteristic  (ROC)  scores,  also  called  areas  under  ROC  curve  [32]. 

An  ROC  score  is  computed  in  the  following  way.  All  test-set  predictions  of  a  particular 
classification  method  are  ordered  by  the  decreasing  quality  score  provided  by  this  method;  for 
example,  in  the  case  of  the  Clustered  Bayes  algorithm,  the  quality  score  is  the  posterior  probability 
that  the  test  object  belongs  to  the  class  correct.  The  ranked  list  is  then  converted  into  binary 
predictions  by  applying  a  decision  threshold,  T .  All  test  objects  with  a  quality  score  above  T  are 
classified  as  correct  and  all  test  objects  with  low-than- threshold  scores  are  classified  as  incorrect. 
The  ROC  score  is  then  computed  by  plotting  the  proportion  of  true-positive  predictions  (in  the  test 
set  we  know  both  the  correct  label  and  the  quality  score  of  each  object)  against  false-positive 
predictions  for  the  whole  spectrum  of  possible  values  of  T ,  then  integrating  the  area  under  the 
curve  obtained  in  this  way,  see  Figure  4. 

The  ROC  score  is  an  estimate  of  the  probability  that  the  classifier  under  scrutiny  will  label 
correctly  a  pair  of  statements,  one  of  which  is  from  the  class  correct  and  one  from  the  class 
incorrect  [32].  A  completely  random  classifier  therefore  would  have  an  ROC  score  of  0.5 , 
whereas  a  hypothetical  perfect  classifier  would  have  an  ROC  score  of  1 .  It  is  also  possible  to 
design  a  classifier  that  performs  less  accurately  than  would  one  that  is  completely  random;  in  this 
case  the  ROC  score  is  less  than  0.5 ,  which  indicates  that  we  can  improve  the  accuracy  of  the 
classifier  by  simply  reversing  all  predictions. 
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TRUE  POSITIVES 


1 


Figure  4,  Receiver-operating  characteristic  (ROC)  curves  for  the  classification  methods  that  we  used 

in  the  present  study. 


In  Figure  4,  we  show  only  the  linear-kernel  SVM  and  the  Clustered  Bayes  44  ROC  curves  to  avoid 
excessive  data  clutter. 
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Figure  5.  Accuracy  of  the  raw  (non-curated)  extracted  relations  in  the  GeneWays  6.0  database. 

The  accuracy  was  computed  by  averaging  over  all  individual  specific  information  extraction 
examples  manually  evaluated  by  the  human  curators.  The  plot  compactly  represents  both  the  per- 
relation  accuracy  of  the  extraction  process  (indicated  with  the  length  of  the  corresponding  bar)  and 
the  abundance  of  the  corresponding  relations  in  the  database  (represented  by  the  bar  color).  There 
are  relations  extracted  with  a  high  precision;  there  are  also  many  noisy  relationships.  The  database 
accuracy  was  markedly  increased  by  the  automated  curation  outlined  in  this  study,  see  Figure  6. 
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Figure  6.  Accuracy  and  abundance  of  the  extracted  and  automatically  curated  relations. 


21 


Figure  6  represents  both  the  per-relation  aeeuraey  after  both  information  extraetion  and  automated 
euration  were  done.  Aeeuraey  is  indieated  with  the  length  of  the  relation-speeifie  bars,  while  the 
abundanee  of  the  eorresponding  relations  in  the  manually  eurated  data  set  is  represented  by  eolor. 
Here,  the  MaxEnt  2  method  was  used  for  the  automated  euration.  The  results  shown  eorrespond  to 
a  seore-based  deeision  threshold  set  to  zero;  that  is,  all  negative-seore  predietions  were  treated  as 
“ineorreet.”  An  inerease  in  the  seore-based  deeision  boundary  ean  raise  the  preeision  of  the  output 
at  the  expense  of  a  deerease  in  the  reeall,  see  Figure  9. 


4.  Results 

The  raw  extracted  facts  produced  by  our  system  are  noisy.  Although  many  relation  types  are 
extracted  with  accuracy  above  80  %,  and  even  above  90  %  (see  Figure  2),  there  are  particularly 
noisy  verbs/relations  that  bring  the  average  accuracy  of  the  “raw”  data  to  about  65  %.  Therefore, 
additional  purification  of  text-mining  output,  either  computational  or  manual,  is  indeed  important. 

The  classification  problem  of  separating  correctly  and  incorrectly  extracted  facts  appears  to  belong 
to  a  class  of  easier  problems.  Even  the  simplest  Naive  Bayes  method  had  an  average  ROC  score  of 
0.84  ,  which  more  sophisticated  approaches  surpassed  to  reach  almost  0.95 .  Judging  by  the 
average  ROC  score,  the  quality  of  prediction  increased  in  the  following  order  of  methods: 
Clustered  Bayes  68,  Naive  Bayes,  MaxEnt  1,  Clustered  Bayes  44,  Quadratic  Discriminant 
Analysis,  artificial  neural  network,  support  vector  machines,  and  MaxEnt  2/MaxEnt  2-v  (see  Table 
6).  The  Meta-method  was  always  slightly  more  accurate  than  MaxEnt  2,  as  explained  in  legend  to 
Table  6  and  shown  in  Figure  4. 

Table  6  provides  a  somewhat  misleading  impression  that  MaxEnt  2  and  MaxEnt  2-v  are  not 
significantly  more  accurate  than  their  closest  competitors  (the  SVM  family),  because  of  the 
overlapping  confidence  intervals.  However,  when  we  trace  the  performance  of  all  classifiers  in 
individual  cross-validation  experiments  (see  Figure  7)  it  becomes  clear  that  MaxEnt  2  and  MaxEnt 
2-v  outperformed  their  rivals  in  every  cross-validation  experiment.  The  SVM  and  artificial  neural 
network  methods  performed  essentially  identically,  and  were  always  more  accurate  than  three 
other  methods:  QDA,  Clustered  Bayes  44,  and  MaxEnt  1.  Finally,  the  performance  of  the 
Clustered  Bayes  68  and  the  Naive  Bayes  methods  was  reliably  the  least  accurate  of  all  methods 
studied. 

It  is  a  matter  of  both  academic  curiosity  and  of  practical  importance  to  know  how  the  performance 
of  our  artificial  intelligence  curator  compares  to  that  of  humans.  If  we  define  the  correct  answer  as 
a  majority-vote  of  the  three  human  evaluators(see  Table  4),  the  average  accuracy  of  MaxEnt  2  is 
slightly  lower  than,  but  statistically  indistinguishable  from  humans  (at  the  99%  level  of 
significance,  see  Table  4;  capital  letters  “A,”  “E,”  “S,”  and  “M”  hide  the  real  names  of  the  human 
evaluators).  If,  however,  in  the  spirit  of  Turing’s  test  of  machine  intelligence  [17],  we  treat  the 
MaxEnt  2  algorithm  on  an  equal  footing  with  the  human  evaluators,  compute  the  average  over 
predictions  of  all  four  anonymous  evaluators,  and  compare  the  quality  of  the  performance  of  each 
evaluator  with  regard  to  the  average,  MaxEnt  2  always  performs  slightly  more  accurately  than  one 
of  the  human  evaluators.  (In  all  cases  we  compared  performance  of  the  algorithm  on  data  that  was 
not  used  for  its  training;  see  Tables  4  and  5.) 
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Table  6.  Comparison  of  the  performance  of  human  evaluators  and  of  the  MaxEnt  2  algorithm. 


Evaluator 

Correct 

Batch  A 

A. 

10,981 

L. 

10,547 

M. 

10,867 

MaxEnt  2 

10,537 

Batch  B 

A. 

9,796 

M. 

9,898 

S. 

9,501 

MaxEnt  2 

9,379 

Incorrect 

Accuracy 
[99%  Cl] 

208  (11,189) 

0.981410 

[0.978014  0.984628] 

642  (11,189) 

0.942622 

[0.936902  0.948253] 

322  (11,189) 

0.971222 

[0.967111  0.975244] 

652  (11,189) 

0.941728 

[0.935919  0.947359] 

430  (10,226) 

0.957950 

[0.952767  0.962938] 

328  (10,226) 

0.967925 

[0.963329  0.972325] 

725  (10,226) 

0.929102 

[0.922453  0.935556] 

847  (10,226) 

0.917172 

[0.910033  0.924115] 

The  first  column  in  Table  6  lists  all  evaluators  (four  human  evaluators,  “A”,  “L”,  “M”,  and  “S”, 
and  the  MaxEnt  2  classifier).  The  second  column  gives  the  number  of  correct  answers  (with 
respect  to  the  gold  standard)  produced  by  each  evaluator.  The  third  column  shows  the  number  of 
incorrect  answers  for  each  evaluator  out  of  the  total  number  of  examples  (in  parentheses).  The  last 
column  shows  the  accuracy  and  the  99%contidence  interval  for  the  accuracy  value.  The  gold 
standard  was  defined  as  the  majority  among  three  human  evaluators  (examples  with  uncertain 
votes  were  not  considered,  so  each  evaluator’s  vote  was  either  strictly  negative  or  strictly  positive). 
Batches  A  and  B  were  evaluated  by  different  sets  of  human  evaluators.  We  computed  the  binomial 
confidence  intervals  at  the  a  -level  of  significance  ( a  x  100%  Cl)  by  identifying  a  pair  of 
parameter  values  that  separate  areas  of  approximately  at  each  distribution  tail. 


23 


ROC  score 


0.95 

0.91 

0.87 

0.83 

0.79 


TIT'  1 

123456789  10 


MaxEnt  2 

—  SVM-t2-g0.5 
-^SVM-t2-gl 

SVM-tl-d3 
^SVM-t2-g2 
-^SVM-tl-d2 
Neural  Net 

—  SVM 
SVM-tO 
QDA 

Clustered  Bayes 
MaxEnt  1 
Na'i've  Bayes 


Cross-validation  experiment 


Figure  7.  Ranks  of  all  classification  methods  used  in  this  study  in  10  cross-validation  experiments. 
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Figure  8.  Comparison  of  a  correlation  matrix  for  the  features 

Figure  8  is  a  comparison  of  a  correlation  matrix  for  the  features  (colored  half  of  the  matrix) 
computed  using  only  the  annotated  set  of  data  and  a  matrix  of  mutual  information  between  all 
feature  pairs  and  the  statement  class  (correct  or  incorrect).  The  plot  indicates  that  a  significant 
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amount  of  information  critical  for  classification  is  encoded  in  pairs  of  weakly  correlated  features. 
The  white  dotted  lines  outline  clusters  of  features,  suggested  by  analysis  of  the  annotated  data  set; 
we  used  these  clusters  in  implementation  of  the  Clustered  Bayes  classifier 


Threshold 


Figure  9.  Values  of  precision,  recall  and  accuracy  of  the  MaxEnt  2  classifier  plotted  against  the 
corresponding  log-scores  provided  by  the  classifier. 


The  optimum  accuracy  was  close  to  88%,  and  attained  at  score  threshold  slightly  above  0.  We  can 
improve  precision  at  the  expense  of  accuracy:  For  example,  by  setting  the  threshold  score  to 
0.6702  we  can  bring  the  overall  database  precision  to  95%,  which  would  correspond  to  a  recall  of 
77.91%  and  to  an  overall  accuracy  of  84.18%. 
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Table  7.  Comparison  of  human  evaluators  and  a  program  that  mimicked  their  work. 


Evaluator 

Correct 

Batch  A 

A. 

10,700 

L. 

10,452 

M. 

10,629 

MaxEnt  2 

10,537 

Batch  B 

A. 

9,499 

M. 

9,636 

S. 

9,332 

MaxEnt  2 

9,379 

Incorrect 

Accuracy 

(Total) 

[99%  Cl] 

182  (10,882) 

0.983275 

[0.980059  0.986400] 

430  (10,882) 

0.960485 

[0.955615  0.965172] 

253  (10,882) 

0.976751 

[0.972983  0.980426] 

345  (10,882) 

0.968296 

[0.963885  0.972523] 

363  (9,862) 

0.963192 

[0.958223  0.967958] 

226  (9,862) 

0.977084 

[0.973130  0.980836] 

530  (9,862) 

0.946258 

[0.940276  0.952038] 

483  (9,862) 

0.951024 

[0.945346  0.956500] 

The  first  column  in  Table  7  lists  all  evaluators  (four  human  evaluators,  “A”,  “L”,  “M”,  and  “S”, 
and  the  MaxEnt  2  classitier).  The  second  column  gives  the  number  of  correct  answers  (with 
respect  to  the  gold  standard)  produced  by  each  evaluator.  The  third  column  shows  the  number  of 
incorrect  answers  for  each  evaluator  out  of  the  total  number  of  examples  (in  parentheses). 
Examples  with  tied  scores  (i.e.  two  positive  and  two  negative  votes)  were  not  considered  for  the 
gold  standard.  The  last  column  shows  the  accuracy  and  the  99%  confidence  interval  for  the 
accuracy  value.  The  gold  standard  was  defined  as  the  majority  among  three  human  evaluators  and 
the  MaxEnt  2  algorithm.  We  did  not  include  evaluation  ties  (two  positive  and  two  negative 
evaluations  for  the  same  statement-sentence  pair)  into  the  gold  standard,  which  explains  the 
difference  in  the  number  of  the  statement-sentence  pairs  used  in  the  3-evaluator-gold-standard  and 
4-evaluator-gold-standard  experiments.  The  even  (2-by-2)  evaluator  splits  are  clearly 
uninformative  in  assessing  the  relative  performance  of  our  evaluators  because  all  four  evaluators 
get  an  equal  penalty  for  each  tie  case.  Batches  A  and  B  were  evaluated  by  different  sets  of  human 
evaluators.  We  computed  the  binomial  confidence  intervals  at  the  a  -level  of  significance 
( a  X 100%  Cl)  by  identifying  a  pair  of  parameter  values  that  separate  areas  of  approximately 
at  each  distribution  tail. 

The  features  that  we  used  in  our  analysis  are  obviously  not  all  equally  important.  To  elucidate  the 
relative  importance  of  the  individual  features  and  of  feature  pairs,  we  computed  the  mutual 
information  between  all  pairs  of  features  and  the  class  variable,  (see  Eigure  8).  The  mutual 
information  of  class  variable,  C ,  and  a  pair  of  feature  variables,  [f^,Fj  j  is  defined  in  the 
following  way  (e.g.,  see  [23,30]). 
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i(p„f-c)  =  h[p„f;]+h(c)-h[c.f„Pi].  (14) 


where  funetion  H{P{x\)  is  Claude  E.  Shannon’s  entropy  of  distribution  P{x)  (see  p.  14  of  [31]), 
defined  in  the  following  way: 


X 


where  summation  is  done  over  all  admissible  values  of  x  .  Figure  8  shows  that  the  most 
informative  standalone  features,  as  expeeted,  are  those  that  are  derived  from  the  human  evaluations 
of  the  quality  of  extraetion  of  individual  relations  and  terms  (sueh  as  the  average  quality  seores), 
and  features  refleeting  properties  of  the  sentenee  that  was  used  to  extraet  the  eorresponding  faet.  In 
addition,  some  dietionary-related  features,  sueh  as  finding  a  term  in  the  Loeus  Link,  are  fairly 
informative.  Some  features,  however,  beeome  informative  only  in  eombination  with  other  features. 
For  example,  the  minimum  positive  distanee  between  two  terms  in  a  sentenee  is  not  very 
informative  by  itself,  but  beeomes  fairly  useful  in  eombination  with  other  features,  sueh  as  the 
number  of  commas  in  the  sentence,  or  the  length  of  the  sentence  (see  Figure  8).  Similarly,  while 
finding  a  term  in  GenBank  does  not  help  the  classifier  by  itself,  the  feature  becomes  informative  in 
combination  with  syntactic  properties  of  the  sentence  and  statistics  about  the  manually  evaluated 
data. 

Assignment  of  facts  to  classes  correct  and  incorrect  by  evaluators  is  subject  to  random  errors. 

Facts  that  were  seen  by  many  evaluators  would  be  assigned  to  the  appropriate  class  with  higher 
probability  than  facts  that  were  seen  by  only  one  evaluator.  This  introduction  of  noise  affects 
directly  the  estimate  of  the  accuracy  of  an  artificial  intelligence  curator.  If  the  gold  standard  is 
noisy,  the  apparent  accuracy  of  the  algorithm  compared  to  the  gold  standard  is  lower  than  the  real 
accuracy.  Indeed,  the  three-evaluator  gold  standard,  see  Table  4,  indicated  that  the  actual  optimum 
accuracy  of  the  MaxEnt  2  classifier  is  higher  than  88%  percent.  (The  88%  accuracy  estimate  came 
from  comparison  of  MaxEnt  2  predictions  to  the  whole  set  of  annotated  facts,  half  of  which  were 
seen  by  only  one  or  two  evaluators,  see  Figure  9)  When  MaxEnt  2  was  compared  with  the  three- 
human  gold  standard,  the  estimated  accuracy  was  about  91%. 
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5.  Discussion 


As  evidenced  by  Figures  2  and  3,  the  results  of  our  study  are  directly  applicable  to  analysis  of  large 
text-mined  databases  of  molecular  interactions.  We  can  identify  sets  of  molecular  interactions  with 
any  pre-defmed  level  of  precision  (see  Figure  9).  For  example,  we  can  request  from  a  database  all 
interactions  with  extraction  precision  95%  or  greater,  which  would  result  in  the  case  of  the 
GeneWays  6.0  database  in  recall  of77.9%  .  However,  we  are  not  forced  to  discard  the  unrequested 
lower-than-threshold-precision  interactions.  Intuitively,  even  weakly  supported  facts  (i.e.  those  on 
which  there  is  not  full  agreement)  can  be  useful  in  interpreting  experimental  results,  and  may  gain 
additional  support  when  studied  in  conjunction  with  other  related  facts  (see  Figure  1  for  examples 
of  weakly  supported  yet  useful  facts,  such  as  cocaine  stimulates  prolactin — ^with  a  low  extraction 
confidence,  but  biologically  plausible,  the  accuracy  predictions  were  computed  using  the  MaxEnt 
2  method).  We  envision  that,  in  the  near  future,  we  will  have  computational  approaches,  such  as 
probabilistic  logic,  that  will  allow  us  to  use  weakly  supported  facts  for  building  a  reliable  model  of 
molecular  interactions  from  unreliable  facts  (paraphrasing  John  von  Neumann’s  “synthesis  of 
reliable  organisms  from  unreliable  components”  [18]). 

Experiments  with  any  stand  alone  set  of  data  generate  results  insufficient  to  allow  us  to  draw 
conclusions  about  the  general  performance  of  different  classifiers.  Nevertheless,  we  can  speculate 
about  the  reasons  for  the  observed  differences  in  performance  of  the  methods  when  applied  to  our 
data.  The  modest  performance  of  the  Naive  Bayes  classifier  is  unsurprising:  We  know  that  many 
pairs  of  features  used  in  our  analysis  are  highly  or  weakly  correlated  (see  Eigures  8  and  9).  The 
actual  feature  dependencies  violate  the  method’s  major  assumption  about  the  conditional 
independence  of  features.  MaxEnt  1  performed  significantly  more  accurately  than  the  Naive  Bayes 
in  our  experiments,  but  was  not  as  efficient  as  other  methods.  It  takes  into  account  only  the  class- 
specific  mean  values  of  features.  It  does  not  incorporate  parameters  to  reflect  dependencies 
between  individual  features.  This  deficiency  of  MaxEnt  1  is  compensated  by  MaxEnt  2,  which  has 
an  additional  set  of  parameters  for  pairs  of  features  leading  to  a  markedly  improved  performance. 

Our  explanation  for  the  superior  performance  of  the  MaxEnt  2  algorithm  with  respect  to  the 
remainder  of  the  algorithms  in  the  study  batch  is  that  MaxEnt  2  requires  the  least  parameter 
tweaking  in  comparison  to  other  methods  of  similar  complexity.  Performance  of  the  Clustered 
Bayes  method  is  highly  sensitive  to  the  definition  of  feature  clusters  and  to  the  way  we  discretize 
the  feature  values — essentially  presenting  the  problem  of  selecting  an  optimal  model  from  an 
extensive  set  of  rival  models,  each  model  defined  by  a  specific  set  of  feature  clusters.  Our  initial 
intuition  was  that  a  reasonable  choice  of  clusters  can  become  clear  from  analysis  of  an  estimated 
feature-correlation  matrix.  We  originally  expected  that  more  highly  correlated  parameters  would 
belong  to  the  same  cluster.  However,  the  correlation  matrices  estimated  from  the  complete 
GeneWays  6.0  database  and  from  a  subset  of  annotated  facts  turned  out  to  be  rather  different  (see 
Eigure  8)  suggesting  that  we  could  group  features  differently.  In  addition,  analysis  of  mutual 
information  between  the  class  of  a  statement  and  pairs  of  features  (see  Eigure  8)  indicated  that  the 
most  informative  pairs  of  features  are  often  only  weakly  correlated.  It  is  quite  likely  that  the 
optimum  choice  of  feature  clusters  in  the  Clustered  Bayes  method  would  lead  to  classifier 
performance  accuracy  significantly  higher  than  that  of  MaxEnt  2  in  our  study,  but  the  road  to  this 
improved  classifier  lies  through  a  search  in  an  astronomically  large  space  of  alternative  models. 
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Similar  to  optimizing  the  Clustered  Bayes  algorithm  through  model  selection,  we  can  experiment 
with  various  kernel  functions  in  the  SVM  algorithm,  and  can  try  alternative  designs  of  the  artificial 
neural  network.  These  optimization  experiments  are  likely  to  be  computationally  expensive,  but 
are  almost  certain  to  improve  the  prediction  quality.  Furthermore,  there  are  bound  to  exist 
additional  useful  classification  features  waiting  to  be  discovered  in  future  analyses.  Finally,  we 
speculate  that  we  can  improve  the  quality  of  the  classifier  by  increasing  the  number  of  human 
evaluators  who  annotate  each  data  point  in  the  training  set.  This  would  allow  us  to  improve  the 
gold  standard  itself,  and  could  lead  to  development  of  a  computer  program  that  performs  the 
curation  job  consistently  and  at  least  as  accurately  as  an  average  human  evaluator. 

Table  8.  The  receiver  operator  characteristic  (ROC)  scores 


Method 

Clustered  Bayes  68 
Naive  Bayes 
MaxEnt  1 

Clustered  Bayes  44 

QDA 

SVM-tO 

SVM 

Neural  Network 
SVM-tl-d2 
SVM-t2-g2 
SVM-tl-d3 
SVM-t2-gl 
SVM-t2-g0.5 
MaxEnt  2 
MaxEnt  2-v 


ROC  seore  +  2cr 
0.81 15  ±0.0679 
0.8409±  0.0543 
0.8647  ±0.0412 
0.8751±0.0414 
0.8826  ±0.0445 
0.9203±0.0317 
0.9222  ±0.0299 
0.9236±0.0314 
0.9277  ±0.0285 
0.9280±  0.0285 
0.9281±  0.0280 
0.9286  ±0.0283 
0.9287  ±0.0285 
0.9480±0.0178 
0.9492±  0.0156 


Table  8  gives  the  receiver  operator  characteristic  (ROC)  scores  (also  called  the  area  under  the 
ROC  curve)  for  methods  used  in  this  study,  with  error  bars  calculated  in  10-fold  cross-validation. 
The  Meta-method  is  much  more  expensive  computationally  than  the  rest  of  the  methods,  so  we 
evaluated  it  using  a  smaller  data  set  and  the  corresponding  results  are  not  directly  comparable  with 
those  for  the  other  methods.  The  Meta-method  outperformed  other  methods  listed  in  this  table 
when  trained  on  the  same  data  (not  shown). 
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6.  Conclusions 

Text-mining  algorithms  make  mistakes  in  extraeting  faets  from  the  natural-language  texts.  In 
biomedieal  applieations,  whieh  rely  on  use  of  text-mined  data,  it  is  eritieal  to  assess  the  quality  (the 
probability  that  the  message  is  eorreetly  extraeted)  of  individual  faets — to  resolve  data  eonfliets 
and  ineonsisteneies.  Using  a  large  set  of  almost  manually  produeed  evaluations  (most  faets  were 
independently  reviewed  more  than  onee  produeing  independent  evaluations),  we  implemented  and 
tested  a  eolleetion  of  algorithms  that  mimie  human  evaluation  of  faets  provided  by  an  automated 
information-extraetion  system.  The  performanee  of  our  best  automated  elassifiers  elosely 
approaehed  that  of  our  human  evaluators  (ROC  seore  elose  to  0.95).  Our  hypothesis  is  that,  were 
we  to  use  a  larger  number  of  human  experts  to  evaluate  any  given  sentenee,  we  eould  implement 
an  artifieial-intelligenee  curator  that  would  perform  the  classification  job  at  least  as  accurately  as 
an  average  individual  human  evaluator. 
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